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Abstract 

A selective sweep describes the reduction of diversity due to strong positive selection. If 
the mutation rate to a selectively beneficial allele is sufficiently high, Pennings and Hcrmisson 
(2006a I have shown, that it becomes likely, that a selective sweep is caused by several indi- 



viduals. Such an event is called a soft sweep and the complementary event of a single origin 
of the beneficial allele, the classical case, a hard sweep. We give analytical expressions for the 
linkage disequilibrium (LD) between two neutral loci linked to the selected locus, depending 
on the recurrent mutation to the beneficial allele, measured by D and o 2 D , a quantity intro- 
duced by Ohta and Kimura ( 1969), and conclude that the LD-pattern of a soft sweep differs 



substantially from that of a hard sweep due to haplotype structure. We compare our results 
with simulations. 
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1 Introduction 

It is a long-standing question of evolutionary biology to decide about the relative importance of 
evolutionary factors such as selection versus genetic drift to shape patterns of diversity. Today 
this topic is studied based on DNA variation data taken from a sample of a population. Important 
work on the effects of positive selection on patterns in DNA data was made by Maynard Smith 
and Haigh (1974). They showed, that neutral variation linked to a beneficial allele also increases 



in frequency. This is called the hitchhiking effect and the resulting reduction of neutral variation 
is termed a selective sweep. When a beneficial allele fixes in a population, this allele can have a 
single or several origins, i.e. it can be brought to the population by a single or by several mutants. 
If several individuals, called founders, account for the fixation of a beneficial allele, we will talk as 
Pennings and Hermisson about a soft sweep and else about a hard sweep. 

There are various reasons for a soft sweep. Adaptation can occur from recurrent migration, 
mutation or act on standing genetic variation. We treat here the case of recurrent mutation, which 
also applies to migration in a special case. Realistic models for recurrent migration in general can 
lead to more complex scenarios due to population structure. 



It has been shown by Hermisson and Pennings in ( Hermisson and Penningsl |2005 Pennings and 



Hermisson| |2006a[ ) , that soft sweep events become frequent, if the scaled mutation rate 9 S = ANu s 
(where TV is the diploid population size and u s the mutation probability to the beneficial allele 
per individual per generation) is sufficiently high. While hard sweeps dominate for 8 S < 0.01, 
both hard sweeps and soft sweeps occur in the range 0.01 < 8 S < 1. For 6 S > 1 almost all 
adaptive substitutions will result in soft sweeps. Soft sweeps become likely for populations with 
large population sizes N or for alleles with high recurrent mutation rates u s . For example, most 
pathogens have extremely high population sizes. Therefore their genomes are good candidates for 
the detection of soft selective sweeps, see e.g. (Nair et al. 2007) for research on soft sweeps in 



malaria parasites. Karasov et al. (2010) concluded lately that in Drosophila melanogaster there 
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Figure 1: The two possible geometries of the selected locus (S) and the two neutral loci (L and 
R). The scaled recombination rates between the two loci are given by psl, Plr, Pls and Psr- 



should exist a huge amount of soft sweeps due to tremendous short-term effective population 
sizes relevant for adaptation. Schlenke and Begun (20051 located some of these regions. Recent 



research by Scheinfeldt et al. (20091 shows, that the DNA pattern around the human gene ALMS1, 
causing the Alstroem Syndrome which presents with early childhood obesity and insulin resistance 
leading to Type 2 diabetes, may also be the result of a soft sweep. Further Tishkoff et al. (2007) 
found out, that different SNPs in the human genome all lying in the same short genome region 
of 110 bp are responsible for the human lactase persistence in the African and European human 
populations. In their studies of LD (measured by the D' value and the LOD score) the pattern 
of a soft sweep can be recognized. Ongoing research argues for the importance of soft sweeps and 
polygenic adaptation, see for a review about this discussion e.g. (Pritchard et al. |2010 ). 

In order to detect soft sweeps it is important to understand the footprints they leave in DNA 
data. For this purpose it is necessary to make statistical predictions available, which allow us 
to find targets of recent positive selection. Pennings and Hermisson (2006a) showed that tests 



based on haplotype structure have high power to detect soft sweeps. If a soft sweep occurred, the 
population can be divided into several haplotype groups, one for each founder. Without mutation 
and recombination during the sweep the genomes of the groups differ at the same loci as the 
founders differed in the beginning of the sweep. Especially, in the case of two founders each allelic 
variant of a SNP locus is always linked to a single haplotype group. So high linkage disequilibrium 
of two neutral loci in a neighborhood of the selected locus should be found. This gives rise to the 
conjecture, that linkage disequilibrium is a useful quantity to detect soft sweeps. 

LD has been computed under neutrality by Ohta and Kimura (1969). Stephan et al. (2006), 



McVean (2007) and Pfaffclhuber et al. (2008) gave analytical expressions for measures of LD af- 



ter a hard selective sweep. Kim and Nielsen (2004) developed a composite-likelihood method for 
detecting hard sweeps incorporating information from measures of linkage disequilibrium based 
on simulation studies. The aim of this article is to give analytical expressions for linkage disequi- 



librium under a selective sweep with recurrent mutation to the beneficial allele (see Theorem 3.1 



and Theorem 3.2). To determine the linkage disequilibrium we use an extended star-like approxi- 
mation for the genealogy of the selected site, see Section |2.4| A similar approach was applied by 



Pfaffelhuber et al. (2008) to obtain the linkage disequilibrium in the case of a hard sweep. We will 
see, that soft sweeps produce a different signal than hard sweeps. In Section [4] we compare our 
computations with simulations. 



2 Model and measures of linkage disequilibrium 

2.1 The frequency of the selected locus during a sweep 

We consider a DNA region of a population of N diploid individuals and concentrate on the neigh- 
borhood of a bi-allelic selected locus S with a wild- type allele b and beneficial allele B. The new 
beneficial allele B with fitness advantage s enters the population recurrently by mutation and 
is assumed to fix eventually. The population reproduces at the beneficial locus according to the 
Moran model in continuous time with selection and recurrent mutation to the beneficial allele, 
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i.e. denoting by (X N (i))t>o the frequency of the individuals carrying the beneficial allele in a 
population of size N, (X N (t)) t >o is a jump Markov process with transition rates from 

i/2N to (i + 1)/2N at rate u s {2N - i) + (2N - i )j^( 1 + s ) i 1 ) 
i/2N to (i- 1)/2N at rate i ^^^ ~ , (2) 

with u s , s > 0. The rate u s (2N — i) is the mutation rate, the rates (2N — + s), i ^2N^ ' 

respectively, are resampling rates which change the frequency of the beneficial allele by plus iA?, 
minus ^f, respectively. Of course, resampling events inside the beneficial and wild-type locus, 
which do not change the frequency of the beneficial allele, are also possible. 

The frequency of the beneficial allele can be approximated for large N by a differential equation: 



Proposition 2.1. Denote by X N (t) the frequency of the beneficial allele in the Moran model with 
constant diploid population size N at time t. Let X N (0) — ^ojP • Then the frequency of the 
beneficial allele X N (t) converges for N — > oo to the solution of the differential equation: 

X(t) = u s (l - X{t)) + sX(t)(l - X{t)) (3) 

with initial condition 

X(0) = e, 

in the sense, that for all S > and all t 

lim P(sup \X N (s) - X(s)\ >S)=0. 

IV->-oo s<t 



The proof is an easy application of Theorem 3.1 in (Kurtz 1971). 
Equation ^ has the solution 

= (eg + u s )se st - Q - es)u s e' u ^ 
[ ' s((es + u s )e st + (s - es)e-"«*) ' 1 ' 

In this approximation we say that the allele fixes in the population, if X^ = X — e. For the 
above equation, this happens at time T = s+ 1 m log ( ^ 1 ~ e g|" s 3 ^"„^~ e ' ) ' ) ^ ■ With e = 1/a, we obtain 

T = — l — log ( (a ^ 1) }l7n 1 ^ s/2)) ) , denoting by a := 2Ns and 6 S := ANu s . For small u s < s and 



s+u„ iu & ^ l + (S s /2) 

large N, such that a»l, the fixation time T is approximately 2(loga)/s. The fixation time will 
be relevant below. 



2.2 Measures of linkage disequilibrium 

Our aim is to provide analytical results for the linkage disequilibrium of two neutral loci in a 
neighborhood of the selected locus. Different quantities have been proposed to measure the asso- 
ciation of two loci. We will compute two of them here. Consider two neutral loci L and R, linked 
to the selective locus S. The neutral loci can either lie both on the same side of the selected locus 
or the selected locus lies between the neutral loci, see Figure [T] (If both neutral loci lie on the left 
side of the selected locus, we name the leftmost locus i?-locus and the locus in the middle L-locus, 
i.e. we have the ordering R L S.) We consider only loci with exactly two allelic variants. Denote 
them by L/£ and R/r and their allelic frequencies by qg, q^, etc. 

Definition 2.2 (-D^ r and D^ r ). The simplest approach to measure linkage disequilibrium between 
the allele I of the L-locus and the allele r of the R-locus is to compute the quantity 

D tt r ■= qtr - qtq r - (5) 
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If Di r is zero the alleles £ and r are said to be in linkage equilibrium, else in linkage disequilibrium. 

In practice, the population frequencies qi,q r , etc. are often not available, but only the allelic 
frequencies in a sample, qi,q r , etc. In samples LD can be measured by 



De,r '■— 



qir ~ qiq r . 



Remark 2.3. An easy calculation shows, that 

D tr = D LR = —D^r = — -Dl, 
and analogous equalities hold for D. 



Averaging D^ r over all allelic variants gives zero due to Remark 2.3 Hence it makes sense to 
consider D 2 „. Since D 2 ^ = D 2 R = D 2 R = D\ r the quantity D 2 r actually does not depend on 



the allelic variant £,r. Therefore we write D 2 instead. However D 2 depends strongly on the size 
of the allelic variants: If qi and q r are small, D 2 is also small, whenever the allelic variants may 
be not in association at all. Therefore the so called standard linkage disequilibrium, introduced 



by Ohta and Kimura (1969), is often considered 



Definition 2.4 {a 2 D and o 2 D ). The standard linkage disequilibrium a 2 D , a 2 D in the sample, respec- 



tively, is given by 
and 

respectively. 
Remarks 2.5. 
• Note, that 



M[qe(l - qi)q r {l ~ q r )] ' 



E[£>. 



l,r 



E[A, 



K[4£(l-§i)9r(l-9r)] 



does not depend on the particular allelic variant £, r, too. There- 



fore it makes sense to write ct|j, instead of cr|j 

• We compute linkage disequilibrium during the sweep. So, if time is important, we will write 
Di,r(t) := qir(t) - qt(t)q r (t), etc. 



Pfaffclhubcr ct al. 



( 2008[ ) have computed ~¥j[D^ r {Q)\D^ r {T)\ and a 2 D for a hard sweep, i.e. 



the case u s = 0. See Figurej^jfor a plot of a 2 D under neutrality and for 9 S = and 9 S = 0.1. 
• Naturally one would consider the quantity 

E 



- - 9k)- 

But this quantity is less attractive for analytical studies, because it is mathematically dif- 



ficult to handle. However, see the recent paper of Song and Song (2007) for an analytical 
computation of r 2 under neutrality. 

2.3 Genealogies: Motivation 

We want to compute ~Wj[Dt^{Q)\Di^ r (T)\ and a 2 D at the end of the sweep assuming small sample 
sizes n <C for the computation of u 2 D . We will use the 1-1-correspondence between the probabil- 
ity to draw two pairs of heterozygous neutral loci and a 2 D , (see step 3 of the proof of Theorem 3.2 ). 



The probability to draw a heterozygous pair at the end of the sweep differs from the probability to 



4 




Frequency of the beneficial allele 0.5 Frequency of the beneficial allele 0.5 



Figure 2: The figure on the left shows two lines which coalesce before they mutate. The figure on 
the right shows two lines which mutate before they coalesce (further back in the past). 



draw a heterozygous pair at the beginning of the sweep due to the change of the genealogy during 
the sweep. We shall start with heterozygous pairs of a sample taken from the population at the 
end of the sweep and follow the lines of the pairs till the beginning of the sweep. In our notation 
time is running backwards starting from time T of fixation, i.e. if ti > t\ the time lies further 
back in the past then the time t\. For example Wi[Dg, r (0)\Dg tr (T)] is the expected value of Dg }r 
at the end of the sweep given Di r at the beginning of the sweep. 

To define the genealogies of two neutral loci in the neighborhood of a selected locus in a Moran 
model we would have to extend the Moran model as introduced in Section [2~T1 to a full three-locus 
model. However, multi-locus genealogies under such a Moran model are very complex. Under 
certain conditions star-like genealogies approximate the genealogies of the Moran model quite well 
and allow a computation of the above probabilities due to independent genealogical lines. In the 
following we introduce such star-like genealogies and justify why it is reasonable to use them in 
our setting. 

We suppose, that neutral mutations occur according to a Poisson Process with rates of order 
0(l/N). Since the sweep takes only of order \og(2Ns)/s time units, we can ignore neutral muta- 
tions during the sweep. Moreover, back-mutations are rapidly sorted out as they have no fitness 
advantage. Hence we will ignore back-mutations, too. 



Coalescent and mutations to the beneficial allele 



• The rate of coalescence of two lines at time t under the condition, that the two lines are in 
the beneficial background at time t- and the frequency of beneficial allele is X t _ — ^ is 
equal to 



1 i(i-l) 
(2iV)(2JV-l) 2N 

»('-!) 
(2N)(2N-\) 



1 (i-l)(2jV-i+l)(l + s) 

(2iV)(2iV-l) 2JV 

(2AT)(2JV-1) 



(l + a)(2JV + l) _ s 
2Ni 2N 



for i > 1. The parents of the beneficial offspring are either both from the beneficial back- 
ground or one is from the wild-type and the second from the beneficial background; for similar 
calculations see (Barton et al. 2004), Lemma 2.4. For large N this rate is approximately 

(6) 



2NX(t) 

since for large N the frequency X N (t) is well approximated by the solution X(t) of the 
differential Equation ^ by Proposition 2.1 
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Frequency of the beneficial allele 0.5 



Figure 3: Possible lines in the time interval [T/2, T) (backward in time) for geometry (a). The 
corresponding probabilities are given in Table [l] 

• If an individual mutates to the beneficial type, the genealogical line of this individual jumps 
(forward in time) from the wild-type background to the beneficial background. Backward 
in time the line is located at time t- in the beneficial background and at time t, after the 
mutation event, in the wild-type background. 

The rate of mutation to the beneficial background of a line at time t under the condition, 
that the frequency of the beneficial allele is X t _ = 577 at time i s ec l ua l to 

u 2N ~ l 



2N 



2JV 

Analogous argumentations as for the coalescence rate yield, that this rate is approximately 

u s (l-X(t)) 



X(t) 

for large N. 



(7) 



Both rates scale with xw: , which means that the coalescent and mutation rates are high, if the 
frequency X(t) is small. Hence it makes sense to assume that all mutations to the beneficial allele 
and all coalescent events occur at time t = T, i.e. at the beginning of the sweep. This scaling of 
the backward mutation rate shows that the star-like approximation, which has before been used 
for the classical hard sweeps case, should also be appropriate for soft sweeps. 

With the approximate mutation and coalescen t rates (|7|) and ([6]) the pr obability for a hard 



sweep in a sample of two can be bounded, see also ( Pennings and Hermisson 2006a ) for a similar 
calculation in a Wright-Fisher-model formulation. The probability for a hard sweep of two lines 
equals the probability, that the coalescent event happens before the mutation event. The mutation 
rate in a sample of two lines is approximately 2 Ws ^ 1 x ^ t ^ — ^WXM^ ' ^ terms of order (2Nu s ) 2 
are ignored. 

Let C = (C t )o<t<T, M = (M t )o<t<T be two independent Poisson processes with rates 
•^W = 2NX(T-t) ' = e3 ^NX{T-t) S> res P ec tively. Denote by S± the first jump time of C and by 
Ti the first jump time of M. Then the probability for a hard sweep in a sample of two is 

Phard.,2 '■— P{S\ <T\). 



G 




Figure 4: Same as Figure p] for geometry (b). The lines (i), (ii) and (v) are not shown, as they 
are the same as in Figure |3| The corresponding probabilities are given in Table [2j 

If the initial frequency of the beneficial allele X(0) = e is small, 

oo t t t 

P{S 1 <Tx)= [P(t< T^fs^dt = J exp ( - J M(r)rfr) A(i)exp ( - / A(r)dr)* = 



= 1 "4 s cxp(- f 1 + < + M- X ( T - T »*r)# = 
J 2NX{T-t) 1 \ J 2NX(T-t) J 

o o 

l + s ( }l + s + s (l - X(T - t) ) h + a + s (l - X(T - rj) 



o 

T t 



2NX(T-t) K J 2NX(T-t) 



2NX(T-s)J PV J 2NX(T~t) ' 



o o 

l + s 



1 



(l + u a T), 



where fs\ (t) denotes the density of the probability measure induced by 5*i and T denotes the 
expected time for the first coalescent or mutation event. In the penultimate equation the first 
summand is the probability for a coalescence or a mutation event, this probability is approximately 
1 for e small. 

The time T lies approximately between and the fixation time T for e small. So for s<S s 
we can (approximately) bound the probability for a hard sweep in a sample of two by 

TTT s - Phard:2 -TTe; il + UsT) 

Hence, the probability for a soft sweep in a sample of two can be (approximately) bounded by 



We can generalize this approach to obtain the (approximate) distribution of the number of 
founders and offspring in a sample of size n. It is given by Ewens sampling formula: 

In a sample of size n at time 0, the probability, that there are aj founders of the sweep (with 
respect to the selected locus) which have j offspring for j £ {1, ...,n} is given by Ewens sampling 
formula 

(D./ir> 



fj[ a 3 
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case 


event 


probability 


(i) 


no recombination event 


PslPlr 


(ii) 


a Li?-recombination event makes the allele at 
the i?-locus escape the sweep without the al- 
lele at the L-locus 


Psl(1~Plr) 


(iii) 


by a SL-recombination event the line escapes 
the sweep and the alleles at the L- and i?-locus 
stay linked 


(1 - Psl)Plr 


(iv) 
(v) 


a SL- recombination event brings the alleles at 
the L- and i?-loci linked into the wild-type 
background; here, the ancestry of both alleles 
is split by a Li?-recombination 
a LR- and a S'L-recombination event bring 
first the allele at the i?-locus and then the 
allele at the L-locus into the wild-type back- 
ground 


P[(iv) or (v)] 

= (1 ~Psl)0- ~Plr) 



Table 1: Probabilities of several events happening between times T/2 and T_ for geometry (a); 
see Figure [3j All events are described backwards in time. 



1). See (Pennings and Hermisson, 2006b) for a derivation 



where 9 s (n) = 6 S ■ {0 a + 1) (0 S + 

of this formula in a Wright-Fisher-model formulation. 

We will assume in our approximation of the genealogy, that the number of founders and the 
number of their offspring is Ewens distributed as given in Equation (fSp . 



Recombination events 

Forward in time at a recombination event two lines merge into one. If a recombination event occurs 
between two neighboring loci L\ and L 2 (we will write L 1 L 2 - r ecombination event, for short), such 
that Li lies on the left side of L 2 , the offspring carries at all loci left of the locus Li including the 
locus L\ the alleles of the first parent and at the remaining loci the alleles of the second parent 
(with L±, L 2 € {L, R, S}). Backward in time at a recombination event one line splits up into two 
lines. 

Since coalescence events occur at the beginning of the sweep, one can assume, that each 
recombination event affects only a single line. The probability for no recombination event in the 
time interval [£1, £ 2 ] is given by the probability, that the first jump time of a Poisson process started 
at time £1 with rate r(£) does not occur until time £ 2 . The rate r(£) depends on the different kinds 
of recombination events and is specified in the following. 

• The frequency X t stays between backward time and T/2 almost the whole time near 1 and is 
certainly greater 1/2. (The larger a the longer X t remains in a small neighborhood of 1.) So, 
in the first half, recombination between the backgrounds is not frequent. Furthermore if L, R 
and S are arranged according to geometry (a) S'L-recombination events inside the beneficial 
background cannot be seen in the DNA-data. The only events that can be recognized in 
the DNA-data and occur at a non negligible amount are Li?-recombination events in the 
beneficial background. If the loci arc arranged according to geometry (b) all recombination 
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line 


event 


probability 


(i) 


no recombination event 


PLSPSR 


(ii) 


a Siil-recombination event makes the allele at 
the -R-locus escape the sweep without the al- 
lele at the L-locus 


Pls{1-Psr) 


(iii) 


a LS'-recombination event makes the allele at 
the L-locus escape the sweep without the allele 
at the i?-locus 


(1 -PLS)PSR 


(iv) 
(v) 


a LS-recombination event followed by a SR- 
recombination event bring the alleles at the L- 
and i?-locus into the wild-type background 
same as (iv) but in reverse order of the LS- 
and S'-R-recombination events 


P[(iv) or (v)] 

= (1 ~ Pls) 0- ~ Psr) 



Table 2: Probabilities for events happening between time T/2 and T!_ for geometry (b); see Figure 
|U All events backward in time 

events in the beneficial locus may be seen in the data. 

The rate of recombination events between loci L\ and L 2 (with L±,L 2 E {L,R,S} for 
geometry (b) and L\ = L and L 2 = R for geometry (a)) in the beneficial background is 

approximately LlL2 x{T-t) with rL 1 L 2 > 0. Therefore the probability for no L\L 2 - 

recombination event is given by 

cxp(-^ ' r LlL2 X(T -t)dt). (9) 

As long as u s is small, the differential Equation ([3]) is only a small perturbation of the 
differential equation X(t) = sX(t)(l—X(t)). For this equation the integral in Equation ^ is 
equal to r£,j£ 2 \og(a)/s + ri Jl i J ^ log(2)/s. Since the second summand is small, we approximate 
® by 

p LlL2 := exp(-p LlL2 log(a)/a), 

where PL t L 2 := ^•^ r L 1 L 2 denotes the recombination rate between the locus L\ and 

• In the time interval [T/2,T] all recombination events with offspring in the beneficial back- 
ground except recombination events inside the beneficial locus are probable to be seen in 
the data. Similar arguments as above lead to the following assumption: The probability for 
no recombination between locus Li and L 2 in the time interval [T/2,T] is given by 

PLiL 2 ■= exp(-p LlL2 log(a)/a). 

See Figure [3] - [5] for an illustration of the different types of recombination events possible in 
the time intervals [0,T/2) and [T/2,T). 

With this motivation, we can define an extended star-like genealogy: 
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Frequency of the beneficial allele 1 



Figure 5: Possible split of two linked neutral loci: Two alleles at the neutral loci linked to the 
beneficial allele either (i) have a common ancestor at time T/2 or (ii) have two different ancestors 
that are both linked to a beneficial allele. 

2.4 Genealogies: Definition 

The joint genealogy of two neutral loci in the neighborhood of the selected locus can be defined as a 
structured partition- valued process. Denote by := partition of A} the set of partitions 

of a set A. A partition £4 = ...,£ m } is called finer than a partition i A = ...,£ fe }, iff for 
each j e {1, to} exists an i e {1, k}, such that ij C £ f . We write £4 =4 i A , if i A is finer than 
i A . Let i = {£1, £ m } e £U and 77 = {771 , r] k } e .., m }, m> k, then the composition 

'/^ { (J (J 

A structured partition of A is a tuple (i A ,i A ) witn {£a U G S a and £f n £4 = 0. Partition 
elements in £4 are called beneficial (wild- type). Denote by 

S I' 6 : = {(£a>£a)I(£a>£a) is a structred partition of A} 

the set of structured partitions of the set A. Elements of a structured partition (i A , are of the 
form (£1,62) with £1 e £f and 6 e ^. 

Define £ := {1, n} the set of the L-loci and r := {n + 1, 2n} the set of the R-loci of a 
sample of size n of the population. We are interested in the structured partitions of I U r. 

For geometry (a) the different kinds of recombination events can change the structured partition 

(i B ,i b ) to 

• (£ S \ {£;f }' £ b U {^}) ' ^ an S'L-recombination event happens between a wild-type and ben- 
eficial line and the offspring carries at the selected locus the beneficial allele (thus at the L- 
and i?-locus the individual carries the alleles of the wild-type line). 

• ((i B \ {if }) U {if n £}, i b U {if n r}) , if an L^-recombination event happens between an 
individual of the beneficial background and an individual of the wild-type background and 
at the S and L-locus the beneficial line is carried on (forward in time). 

• ((i B \ {if}) U {if n £,if n r},i b }), if an LiZ-rccombination event happens between two 
individuals of the beneficial background. 
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• (£ B > (£ b \ U {Cfc n £| n r}) , if an Li?-recombination event happens between two in- 
dividuals of the wild-type background. 

• (£ B U {Cfc} ; C b \ {Cfe}) ! if an Si-recombination event happens between a beneficial and a 
wild-type line and the offspring carries the beneficial allele. 

• (£ s U n r}, (£ h \ U n £}) , if an ii?-rccombination event happens between a 
beneficial and wild-type line and at the selected and L-locus the wild-type is carried on 
(forward in time). 

For geometry (b) the partitions change in an analogous manner. 

Before we give the definition of an extended star-like genealogy we define genealogies and 
samples. 

Definition 2.6. The genealogy of a set A is a four-time step Markov chain 

(£t)te{0,T/2,T_,T} = ((£(f ,£o)> (£t/2'£t/2)> (£t_>£t_); (£tj£t)) 

with state space S^' b . 

A set £ U r with £ :— {1, n} and r := {n + 1, 2n}, n € N ; is a sample at two loci L and 
R taken from the population at time t = ; if the genealogy of the sample is at time t = given by 
£o = ({{l,n + l},...,{n,2n}},{0}). 

With this we can define an extended star-like genealogy as a four-time step random experiment: 

Definition 2.7. An extended star-like genealogy of a sample £Ur with £ := {1, ...,n} and r := 

{n + 1, 2n} at two loci L and R in the neighborhood of a selected locus S arranged according to 
geometry (a) (resp. geometry (b)) is a four-time step Markov chain 

(6)t€{0,T/2,T_,T} = ((^>^o)>(^T/2>^T/2))(^T_)^T_))(^T)^t)) 

with state space with the following properties: 

• At time T/2: 

— Structured partition elements are stochastically independent 

— No recombination events between the backgrounds and no mutations to the beneficial 
allele, i.e. P(Q /2 = 0) = 1 

— No coalescence events, i.e. -P(Ct/2 ^ £(f ) = 1 

— For geometry (a): No LR-recombination events in the beneficial background with prob- 
ability plr, i-e. for j e I a structured partition element at time t = of the form 
({{i,i + n}},{0}) 

* is kept at time T/2 with probability plr 

* and changed to ({{j}, {j + n}}, {0}) with probability 1 — p LR 

— For geometry (b): Neither a LS -recombination events nor a S R-recombination events 
in the beneficial background happens with probability PlsPsr, *- e - f or j <= I a structured 
partition element at time t — of the form ({{j,j + n}}, {0}) 

* is kept till time T/2 with probability PlsPsr 

* and changed to ({{j}, {j + n}}, {0}) with probability 1 — PlsPsr 

• At time T_ : 

— Structured partition elements are stochastically independent 
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— No coalescence events, i.e. -P(£t_ U £t_ =^ £772) = 1 

— For geometry (a): For j € i 

* a partition element at time T/2 of the form + n}}, {0}) 
is kept at time T_ with probability pslPlr, 

changed to ({0}, {{j,j + n}}) with probability (1 — Psl)plr, 
changed to ({{j}}, {{j + n}}) with probability pslO- ~ Plr) 
and changed to (0, {{j}, {j + n}}) with probability (1 — psx)(l — Plr)- 

* a partition element at time T/2 of the form ({j}, {0}) 
is kept at time T_ with probability psl 

and changed to ({0}, {j}) with probability 1 — Psl- 

* a partition element at time T/2 of the form ({j + n}, {0}) 
is kept at time T_ with probability Psr. 

and changed to ({0}, {j + n}) with probability 1 — Psr- 

— For geometry (b): 

* A partition element at time T/2 of the form + n}}, {0}) 
is kept at time T_ with probability plsPsr, 

changed to (0, {{j}, {j + n}}) with probability (1 — pls)(1 — Psr)- 
changed to ({{j}}, {{j + n}}) with probability pslO- — Psr) 
and changed to ({{j + n}}, {{j}}) with probability (1 — Psl)psr- 

* A partition element at time T/2 of the form ({j}, {0}) 
is kept at time T_ with probability p^s 

and changed to ({0}, {j}) with probability 1 — Pls- 

* A partition element at time T/2 of the form ({j + rt}, {0}) 
is kept at time T_ with probability psr 

and changed to ({0}, {j + n}) with probability 1 — psr- 

• At time t = T: 

At the beginning of the sweep all coalescence and mutation events happen: Let m £ N and 
a 3 € {0, ...,m} with EJLi>j = m. Denote by M m := ,£ b ) € = to} tfie 

sei of structured partition o/lUr which beneficial partitions consist of m elements and by 
N ( ai ,...,a m ) ._ |^ _ ( 7?1; ... 5?7fe ) e S{ 1) ,.. )TO };#{77j; \r)t\ = j} = aj} the set of partitions of 
{1, ...,to} containing aj partition elements of size j. Then for r\ € N^ ai ' 4 "' a " m ' 

p(fr = ({0}, (v o a B ) u e b )ier_ - (e s ,e b ) g rj = -^L ( i ) 

sV ; i=i J 

FFe say, i/iat a population evolved according to an extended star-like genealogy, if the genealogy of 
each sample of the population is extended star-like. 

Remark 2.8. If we are interested in the genealogy of a subset M of a sample £Ur, the genealogy 
of M fulfills all conditions of Definition |2.7[ In particular, at time t = T the number of founders 
together with the number of their offspring is Ewens distributed, since Ewens sampling formula is 
consistent. At time t — 0, the genealogy of M is given by 

£o = ({{1, n + 1} n AT, . . . , {n, 2n} n M) , {0}) . 

In accordance to the possible recombination events during the time interval [T/2, T) we obtain 
the ancestral lines shown in Figure [3] for geometry (a) and Figure [4] for geometry (b). The 
probabilities for these events are listed in Table [l] for geometry (a), in Table [2] for geometry (b). 
For the time interval [0, T/2) the possible ancestral lines are shown in FigureTSl In Figure H the 
left picture shows two lines which coalesce first and mutate then, in the right picture the lines 
mutate first and coalesce afterwards. 
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3 Results 



Our main result is the computation of the linkage disequilibrium at the end of the sweep measured 
by JEi[De. r (0)\De. r (T)] for two fixed allelic variants £ and r and u 2 D for two neutral loci in a 
neighborhood of the selected locus (backward in time) . 



We apply the procedure of Pfaffelhuber et al. (2008) to compute E[D^ r (0)|£)< )f .(T)]. The main 



difference between our model and the hard sweep model is, that two lines do not have to coalesce, 
since both lines may mutate to the beneficial allele. 

Theorem 3.1. Assume, that the population evolved in a DNA-region containing the two neutral 
loci L and R and the selected locus S according to an extended star-like genealogy and both loci 
carry exactly two allelic variants IjL and r/R. Then the linkage disequilibrium of the allelic 
variants I and r measured by M[Di tr (0)\D£ >r (T)] at the end of the sweep is given by 

E[X> Ar (0)|^, r (T)] = P 2 LR (1 - —L-^JD^T), (11) 

if the two neutral loci are arranged according to geometry (a) and 

E[A,r(o)|£>v(T)] =pIr( 1 - YTe~J De ' r<yT ^ (12) 

for L and R arranged according to geometry (b). 

Proof. Indeed, consider the genealogy (£)te{o,T/2,T_,T} of an L-locus {1} and an i?-locus {2}, i.e. 
£t G ^fi 2}- Denote by d the probability, that the pair {1}, {2} was linked at the beginning of the 
sweep, if it is linked at the end of the sweep, i.e. let d := P(£ T = ({0}, {1, 2})|£ = ({{1, 2}}, {0})). 
Analogously, denote by e the probability, that the pair has been linked at the beginning, if it is 
unlinked at the end of the sweep. I.e. e := P(£ T = ({0}, {{1, 2}})|£ = ({{1}, {2}}, {0})). Then 
we can write 

Hqtr(0)\q tr (T),q t (T),q r (T)] = dq £r (T) + (1 - d)q e (T)q r (T) 
H<lt(0)<lr{0)\qer{T) 7 qe(T),q r (T)} = eq er (T) + (1 - e)q £ (T)q r {T) 
with qi r £ [0, 1] and q r , qe € (0, 1) and so 

E[D Ar (0)|D v (T) = x ] = (d- e)D iir (T). 

The probabilities d and e are for geometry (a) and (b) given by 

1 1 2 - 1 1 



a) e= Y^TfVSLPSR = Y^rf p SLPLR b) e= z^^-plsPsr = Y^Tf PLR 



and 



a) d = e(l - p LR ) + plrPlr b) d = e(l - plr) + PlrPlsPsr = Plr- 

In words, a pair is unlinked at the end of the sweep when it was linked at the beginning, iff the 
pair just coalesces, i.e. neither a recombination event between the S and the L locus neither a 
recombination event between the S and R locus occurred and the two loci coalesced before they 
mutated. And a pair which is linked at the end of the sweep is also linked in the beginning, iff 
either nothing happens or the pair is divided by a LR-recombination event first and then linked 
again by coalescence event. 



From this easily follows Equation (11 ) for geometry (a) and Equation ( 12 ) for geometry (b). □ 



To compute the quantity o~ 2 D consider the three quantities: 

Xt := E[g£(t)(l - q L (t))q R (t)(l - q R (t))} 



for < t < T. 



= E[D(t)(l - 2q L (t)){\ - 2q R (t))) (13) 

= n(D(t)) 2 } 



13 



Theorem 3.2. Given Xt^t and Zt at the beginning of the sweep and a sample of size n 
of a population at the end of the sweep, i.e. a set of L-loci t := {1, ...,n} and a set of R-loci 
r := {n + 1, ...,2n}, assume that the genealogy of the sample is extended star-like. Then the 
standard linkage disequilibrium a 2 ^ of this sample of two neutral loci at the end of a sweep equals: 

o% = Zo/Xo. (14) 

with 

Zo = P 4 lr(psl - 1) 2 (p 2 sl(*t + y T ) + (1 + 2psl)Z t ) 

+ 9 s Psr[-^-(plr(Hpsr - 2 - 6p L R.) + psb(2 - 4p SR )) (15) 

+ j^(psr(2 - 21p S R) +Plr(38psr ~ 1$Plr - 2)) + — (9p% R - 9psrPlr+Psr 



and 



X = (1 -PSR.){PSL - l)(X T {l + p S R+ PSh) + {X T +yT){pSLPSR)) 

o] 
3 



; ' L (x T (3pl R ~ 5 PLR + 3 + 2 PsrPlr + 2 PSR - 4p 2 SR ) (16) 



17 21 

+ 5y T {pSRPLR + PSR ~ 2qPLr) + PSR.( Z T ~ -j^T) 



for geometry (a) and with 

'X- 



+ ^ (3p L R ~ PSR - PLS + 1) + ~Y?LR 



(17) 



Xo = X T ({1 - pls){\ - p% R )) + y T (pLR,(l ~ PLS)(1 - PSR)) 

+ 9 S (-^(Spls + 2plrPsr ~ 4p 2 LR - 5p LR + 2p LR p LS + 3p 2 SR ) (is) 

+ ^M(20p LS - 21 PLR - 17 + 20 PSR ) + ^-p 2 LR 
for geometry (b), if we ignore in both geometries terms of order 9 2 s and 1/n. 

For a proof of this theorem see Section [6j 
Remark 3.3. 

• In the supporting online material you find a Mathematica-notebook for computing the exact 

values of the standard linkage disequilibrium measured by without ignoring terms of 
order 6 2 and 1/n. 



• If the population evolves neutrally till the beginning of the sweep, Ohta and Kimura ( 1969 \ 
have shown, that 

1 2 5 + 26 + p LR )(3 + 26 + 2p LR )-A 



46 + 1 (1 + 6>)(3 + 26 + 2 PLR ){b + 26 + p LR ) - 2(3 + 26) 

y T = Jl 1 

6 + 1 {l + 6){3 + 26 + 2p LR ){5 + 26 + p LR )-2(3 + 26) 

= 1 2 20 + PLR + 5 

T 4 6 + 1 ' (1 + 0)(3 + 26 + 2p LR )(5 + 26 + p LR ) - 2(3 + 26) ' 

where 6 := 4Nu is the neutral mutation rate. For a comparison of the theoretical results 
with simulations we assume that the population evolved neutrally till the beginning of the 
sweep. 
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Figure 6: Theoretical values of o 2 D in the neutral setting, for 9 S = and 8 S = 0.1. The distance 
between the neutral loci is 0.2 kb, the selection strength a = 1000, the population size N = 10 6 , 
the recombination rate between the neutral loci plr — 5 and the neutral mutation rate 6 = 0.005. 

• See Figure [(j] for a plot of the theoretical values of a 2 D for different values of 9 S . Here we 
assumed as well neutral evolution till the beginning of the sweep. 



• Note, that for 9 S — we obtain a 2 D for a hard sweep, compare (Pfaffelhubcr et al 



2008). 



4 Simulations 

We simulated sequence samples with the new program msms (for ms mit Selektion (German: with 



selection)) of Greg Ewing, see (Ewing and Hermisson 2010) to compare our theoretical linkage 



disequilibrium values with linkage disequilibrium values obtained from simulated genealogies as- 
suming neutral evolution till the beginning of the sweep. The program msms generates sequence 
samples for a single selected locus of a population reproducing according to the Wright-Fisher- 
model with the possibility of recurrent mutation to the beneficial allele. The frequency of the 
beneficial allele is simulated stochastically conditioned on fixation. In Section 2 we argued, that 
the star-like genealogies approximates the Moran model genealogies well. For large population 
sizes the Moran model and Wright-Fisher model deliver similar genealogies, if the parameters 
are appropriately scaled. So instead of comparing the theoretical results with results obtained 
from Moran model simulations, we can check the theoretical results against results obtained from 
Wright-Fisher model simulations. 

We consider a 5-kb stretch of DNA in a sample of n = 20 taken at time of fixation of the 
beneficial allele. We divide the stretch into 50 bins, each of length O.lkb and measure LD between 
SNPs of two different bins averaged over 10 4 draws. Figure [7] shows the results for a recurrent 
mutation rate 8 S — 0.1 and 9 S — 0.5, respectively. The neutral mutation rate = 0.005, the 
recombination rate p = 0.025 for two neighboring loci, the distance between neighboring neutral 
loci L and R is 200bp, the selection strength a = 1000 and the population size N — 10 6 in both 
plots. These parameter values are realistic for example for Drosophila melanogaster samples. 

As we see in Figure [7] there is a good fit between simulated and theoretical values of a 2 D . For 
small sample sizes the extended star-like genealogies approximate the simulated Wright-Fisher- 
genealogies well. The linkage disequilibrium is for theoretical and simulated values high, if S 7^ 
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Figure 7: Left figure: Plot of a 2 D for a neutral mutation rate 9 = 0.005, recombination rate 
p = 0.025, selection strength a = 1000, recurrent mutation rate to the beneficial allele 6 S = 0.1, 
a distance of 200 bp between the two neutral loci and a DNA-stretch of length 5 kb based on 
10 4 draws. Right figure: Plot of with the same parameters as in the left figure except for the 
recurrent mutation rate to the beneficial allele 6 S = 0.5. 




Figure 8: The full linkage disequilibrium spectrum for a single sample of a soft sweep with two 
founders with respect to the beneficial locus. 
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and the distance to the selected locus is small, independently of the geometry of the considered 
neutral loci. Due to recombination linkage disequilibrium decreases with increasing distance to 
the selected locus. ____ 

The differences between the theoretical and simulated values of a 2 D are due to the approxima- 
tion of the genealogy. The approach has three effects on LD. 

First, a star-like genealogy assumes independent recombinants. But of course in the simulated 
genealogies may also occur coalescence events before recombination events, in particular may 
arise early recombinants (see (Durrett and Schweinsberg 2004) or for slightly different models 



(Etheridge et al. 2006])). For geometry (a) it is important, that recombinants with offspring lead 



to less "independent" variation, which can be seen in higher LD values of the simulated data. 
For geometry (b) early recombinants become noticeable, because they produce patterns similar 
to soft sweep patterns: Recombinants spread through the populations act as additive founders 
of the sweep. For this reason our approximation of the genealogy assumes less founders of the 
sweep than the simulated genealogies have. Therefore the LD-patterns of simulated data should 
look like the LD-pattern of the theoretical values with a slightly higher 9 S value. Higher 9 S values 
produce in geometry (b) less a 2 Dl compare the pictures in Figure [?[ This effect becomes negligible 
for increasing 9 S . On the one hand the fixation of the beneficial allele gets faster, on the other 
hand for intermediate and high values of 6 also extended star-like genealogies assume in average 
more than two founders. By measuring linkage disequilibrium one can distinguish well between 
the existence of one or two founders of the sweep, but not between the existence of three or four 
founders. 

Second, the star-like approximation of the genealogy is in general longer than the simulated 
genealogy, since the beneficial allele spreads faster through the population, if the lines are depen- 
dent. Therefore, more recombination events are assumed to fall on the theoretical genealogies than 
on the simulated genealogies. For loci in a small neighborhood of the selected locus this means, 
that more SNPs can be found for the star-like genealogy due to recombination. Third, SNPs of 
simulated data are noisier, they may exist also due to neutral mutation during the sweep. Both 
effects can in geometry (a) be recognized by higher theoretical LD values in a small neighborhood 
of the selected locus. But at a certain point, the effect turns over: More recombination brings 
more "independent" variation into the sample: The theoretical values of LD in geometry (a) lie 
below the simulated values. 

Often one is interested in the case of a single sample. We simulated a single sample of a 
soft sweep with two founders and computed linkage disequilibrium with that data. The result is 
plotted in Figure [8j In that case the pattern is very clear. In the neighborhood of the selected 
locus, high linkage disequilibrium can be found independent of the geometry of the loci. However, 
such clear patterns cannot be expected in general, even if the sweep has two founders. It is likely, 
that the number of offspring is not distributed equally between the founders. For example it may 
happen, that in a sample of 20 individuals with respect to the beneficial allele 2 individuals are 
offspring of one founder and the remaining 18 individuals are offspring of the second founder. 
For such unbalanced cases stochastic effects caused by recombination and mutation destroy the 
pattern easily. 



5 Discussion 



Soft sweeps have been introduced by Pennings and Hermisson in their series of papers ( Hermisson 



and Pennings 2005), (Pennings and Hermisson, 2006a), (Pennings and Hermisson] |2006b ). They 



argued, that tests based on haplotype structure have high power to detect soft sweeps. Linkage 
disequilibrium is a test sensitive to haplotype structure. If allelic variants are tightly linked to a 
haplotype, LD is high for pairs of such alleles. We have seen, that linkage disequilibrium under 
a non vanishing recurrent mutation rate differs sufficiently from linkage disequilibrium under 
neutrality and hard sweeps, see Figure [6] 

We computed to understand the interplay of haplotype formation due to a soft sweep and 
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recombination. The main reason to compute a 2 D instead of E[r 2 ] is its mathematical manageability. 



D 

T 2 



However, former studies show (and the present study does that also), that also a\, measures what 
intuitively is understood under linkage disequilibrium and gives a possibility to distinguish between 
different population genetics scenarios. 

When a soft sweep occurs, recombination breaks up the linkage of loci due to haplotype struc- 
ture. Under hard sweeps recombination causes linkage of loci lying on one side of the selected locus. 
In Figure [^theoretical values of a 2 D are plotted for different values of 9 S and under neutrality. The 
behavior can be explained roughly in the following manner: 

For small values 9 S we see for both geometries high values of ajj in a small neighborhood of the 
selected locus decreasing with increasing distance to the selected locus. If 9 S is relatively small, 



Pennings and Hermisson (2006a I have shown, that soft sweeps are not very likely, most sweeps 
will be hard. LD of hard sweeps depends on recombination. Only recombination brings variation 
into the sample which is necessary to compute linkage disequilibrium. 

After a hard sweep we can see the following pattern of LD due to recombination. Recombination 
between the L-locus and the S'-locus includes for geometry (a) always a recombination between 
the i?-locus and S'-locus or between the L-locus and the i?-locus, i.e. the L-locus recombines not 
independently of the i?-locus. Therefore LD is high for geometry (a) for a hard sweep. In geometry 
(b) an LS-recombination event does not cause a Si?-recombination event and vice versa. So with 

respect to recombination the L-locus is independent of the R- locus. Hence o 2 D is small. If the 
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sample is not finite, a D is even zero, see Remark 

If a soft sweep occurred, different founders of the sweep bring the variation into the sample 
- recombination is not necessary. If there are exactly two founders and there exist loci with two 
allelic variants, such that one allelic variant is carried by one haplotype and the other allele by 
the other haplotype, two of such loci are tightly linked, only recombination can break up this 
linkage. Therefore after a soft sweep with only a few number of founders LD is high in a small 
neighborhood of the selected locus, independent of the geometry. But the more founders the soft 
sweep has, the more variation is in the sample not linked to single founder. This reduces LD. 

For r 2 we expect for very small values of 9 S patterns of LD similar to hard sweeps, because for 
very small values of 9 S soft sweeps are rare. But <j 2 d shows even for very small values of 9 S high 
values in a small neighborhood of the selected locus. This comes from the fact, that small values 
of D 2 expected after a hard sweep in geometry (b) have a smaller effect on the numerator of a 2 D 
than higher values of D 2 expected after a soft sweep in geometry (b). An analogous statement 

holds for the denominator of a 2 D . 

For biological studies often the pattern of a single selective sweep is of interest. After a soft 
sweep we expect to find high LD of two neutral loci lying in a neighborhood of the selected locus, 
but almost neutral variation. It can be found haplotype structure, where each founder of the 
sweep gives rise to one haplotype. In each haplotype group a hard sweep occurred, i.e. almost no 
variation can be found, low LD for neutral loci lying on different sides of the selected locus and 
high LD for loci lying on the same site of the selected locus. In Figure [S] simulation results of 
a single sample of a soft sweep with two ancestors with respect to the selected locus are shown. 



As well as Tishkoff et al. ( 2007 1 found a comparable clear linkage disequilibrium pattern of a soft 
sweep in their studies of the human DNA when analyzing the human lactase persistence in African 
and European human populations. 

An adaptation process may not only be initiated by mutation, but also through recurrent mi- 
gration or from standing genetic variation during an environmental change. A two-island model 
with the beneficial allele fixed in one of the islands and migration from this island to the other 
coincides with our model for recurrent mutation. A more realistic model assumes, that the bene- 
ficial allele is not fixed in both islands and that the allelic frequencies ql, qR, etc. do not coincide 
on both islands. Unfortunately such (simple) modifications make the calculations in the proof of 
Theorem |3.2[ especially of matrix A and B, quite complicated. 

An improvement of the results could be made by approximating the genealogy not by a star-like 
approximation but by a marked Yule process with immigration. It has been shown by |Hermisson| 

I 18 I 



and Pfaffelhuber (20081, that the joint genealogy of the population is better approximated by 
these processes. However, explicit calculations become with this approximation complicated, since 
recombination is not independent along lines during the sweep. 



6 Proof of Theorem 13.2 



We proceed in five steps. The quantities X t , 3^t, Zt can be expressed in pairwise heterozygosities. In 
step 1 we will give this connection. In step 2 we show, how pairwise heterzygosities are transformed 
to sample heterozygosities. In step 3 and 4 we show, how pairwise heterozygosities at time t = T 
are transformed to pairwise heterozygosities at time t = 0. In step 5 we collect everything together. 

Step 1: Link between the pairwise heterozygosities f t ,gt,ht and X t , 3^, Z t 



The quantities X t ,y tl 2 t can be expressed in terms of probabilities for pairwise heterozygosities. 

Denote for this purpose by ft the probability that two pairs heterozygous in both loci are 
linked, by g t the probability, that exactly one pair of the two pairs is linked and the other pair is 
unlinked and by h t the probability that both pairs are unlinked at time t. We can express these 
probabilities in terms of structured partitions: Let i\ 1 ^2 be two L-loci and r\,r2 be two i?-loci 
taken from the population. Let & = (£^,£t) be the genealogy of {■£]., #2, r%, at time t, then 

ft = P(& is heterozygous and £f U # = {{£i,n}, {£2, r 2 }}), 

g t = P(& is heterozygous and £f U $ = {{£i,n}, {£ 2 }, {r 2 }}), 

h t = P(& is heterozygous and £f U & = {{4}, {n}, {£2}, {r 2 }}) 

From an easy calculation (see for details also (Pfaffelh uber et aL||2008 ), Equation (A3)) it 
follows, that 




Step 2: Link between pairwise heterozygosities f,g,h and sample heterozygosities /, g, h 

Denote by / t , g~ t and ht the corresponding sample probabilities, i.e. €1,^2 £ t and T\,r% G r. 
It is possible to pick the same individual twice in a sample. Therefore the following relationships 
hold: 

& _(,_!) (!-?), + (!-!)!/, 

fi- ? Vi- i Wfi- i Wi- ? Wd- i, ! ? i/.. 



n J \ n J \ n J \ n 1 n \ n / \ n 1 nn 



Denoting 
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this is equivalent to gt I = F 
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For example, two linked pairs of one allele at the L- and one allele at the i?-locus each taken at 
random (with replacement) from a sample are heterozygous, if we did not pick the same individual 
twice and the resulting two different lines are heterozygous at both loci. 

In the next two steps we compute how to find fa, go and ho given /t, <7t and hr 7 respectively. 
Step 3 : From Jt/2i9t/2 and h T / 2 to /o,<7o and ho, respectively 
For both geometries we have 



with 



fo \ 


/ /T/2 


9o \=C 


1 5T/2 


ho J 


\ h T / 2 


2plr{1 - 


Plr) 


Plr 










C = [ plr 1 - Plr 

1 



Our model assumptions coincide with the model assumptions of Pfaffclhuber et al. (20081 in 
the time interval [T, T/2), so that we obtain the same results here. 

Step 4 : From /t,<7t and hr to Jt/2t9t/2 and hr/2, respectively 

For this time step it is important to note, that it has to be paid attention not only on the two 



neutral loci, but also on the selected locus. We use Ewens sampling formula (see Equation 10 1 
to compute the probabilities, if the ancestral lines of the pairs share with respect to the selected 
locus a common ancestors or different ancestors. With this we get the following relationships: 

for geometry (a) 

\ ^T/2 / \ h-T J 

and 



/T/2 1 


( h 


9T/2 


= A\ g T 


hr/2 ) 





/T/2 ^ 






9T/2 


- B \i) 


for geometry (b) 


hr/2 J 







with matrix 
jiven by 



A = (aij) 



ij)l<i,j<3 
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«n = -^t-tPlr + ^— t( 1 - Psl)pIr 



®s 2 L 

i 

«i2 = ^-^y 2pLi?(l - Plr) + 2(1 - PslIPlrQ - Plr) 

f ' T (i - PLR f + ^(l-p^a-m) 2 

2 
° 21 = (0 s + l)(0 s +2) PsiPSi?PM + (0 S + 1)(0 S + 2) (1 ~ 

° 22 = (^ + 1)^+2)^ + (^ + l)(^ + 2) (3 ^ - 3 ™*) + 

2 

+ (g +2 )(6> + i) PLfl ^ "P^X 1 + _ ^PslPsr + 2Psl) 

Q 

+ + l)(g +2) ^ ~ PLR ^ 1 -PSL + 1 - PS LPS R + 1 - PslPsr) + 
2 

+ Jq + 1 )(fl +2 ^ (1 -P5L)(1 +P5L - 2PSLPSR.) 

° 31 = ((9 a + l)(<?!+2)((9 a + 3)^ fl 
40 s 2 

° 32 - (0 s + l)(0 s+ 2)(0 s+ 3)™^ + 
20 

+ (0 S + 1)(0 S TWs + 3) (ps ^ Si(1 ~ Ps*Psl))+ 
A0 S 

+ (0 s + 1)(0 s +2)(0 s +3) P3LPsr{2 PSL Psr) + 

+ (9s + + Ws + 3) (4(1 " Psl){psl{1 PS*)PS*)) 

= 61 

° 33 ~ (0 s + l)(0 s +2)(0 s +3) + 

+ (fl , + i) (fl .+ 2 )(fl,+3) (4(1 - PSLPSR) + PSLPSL) + PshPsr))+ 
40 s 

+ (0 S + 1)(0 S + 2)(6> s +3) ^ 1 ~ PSJS: ^ 1 ~~ PSL ^ ~ PSR ) + ( X ~ - Psi?>SL)+ 

40 

+ (0 S + 1)(0 S +2)(0 S +3) (2(1 " PSR){1 PSL)PSR + Psl){1 ~ 
40 s 

+ (0 +1)(6> + 2)(0 +3) ( ' 2( - 1 ~ PsR ') psL ( 1 ~Psl) + (1 -Psl)(1 -P SjR ))+ 

+ ^ +1 ^ 6> ^ 2 ^g + 3 ^ (2(1 -PslPsr)^ -PslPsr) + (1 -PslPsl)(1-PsrPsr)) + 

+ (9s + Ws + 2)(0 S + 3) ((1 " PSL)(1 ~ PSR){1 + PSL + PSfl " 
and matrix 

-B = 0>ij)l<i,j<3 

given by 
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r "8 2 2 

20 4 

612 = 7, * PsrPls(1- PlsPsr) + ^—^Pls(1 ~ Pls)PsrO- ~ Psr) 

o s + i p s + 1 

61 3 = a i 1 (! ~-PLgP5J?.) 2 + 

C/ s + 1 

+ fl 1 -i (1 -.PLS)(1 -PSfi)(2pLs(l -pSfl) + 2(1 ~ Pls)PSR + (1 -PLS)(1 ~ PSR.)) 
t) s + 1 

hl = {e s + i)\e s + 2/ Lsp2sR 
&22 - (e s + i)(e s + 2) PLsPsR+ 

a 

{PlsPsr{1 - PlsPsr) + (2 - psr - Pls)psrPls + PsrPlsO- ~ PlsPsr))+ 



g 

+ (0 s + l)(0 s +2) Pis(1 " Pis) ^ (1 " P5fl) 

b23= (e s + i)(e s + 2) {1 - psRPLs)+ 

Q 

+ (g s + i)(g s+2 ) ^ 1 ~ PLS ) 2 ^ ~ + 2 Pls(1 - Pls){1 - Psr))+ 

(psr(1 - Pls) 2 + (1 - PlsPsr) 2 + (1 - Psi?) 2 (l - Pls)) + 



(# s + l)(fl s +2) 



+ (0- + l)(fl t +2) ^ 2psJ ^ 1 ~ Psfl ^ 1 +Pls ( 1 ~Psr) 2 ) 

2 

(1 -Pis)(l -Psfl)(2pi,s(l -Psk)) 



(0 s + l)(0 s +2) 
2 

+ (0 s + l)(0 s +2) ( 2 ( 1 ~ PLS " >PsR + ( X -Psh)) 
^31 = °3i w ith pls instead oi psl 

632 = 032 with p L s instead of psl 

633 = a33 with p LS instead of p S L 

To see the above equations, consider for example in geometry (a) the term a 2 i: In this case 
at time T/2 there are two pairs, which are heterozygous in both loci and exactly one of the 
pairs is linked. If the two pairs have two different ancestors with respect to the selected locus 
neither a SX-recombination nor a SX-recombination must happen for the unlinked pair, nor a 
Li?-recombination event to the linked pair. Therefore the probability to stay linked also at the 
beginning of the sweep is (g +i)(fl + 2) PslPsrPlr using Ewens sampling formula. If the two pairs 
have a single ancestor, the linked pair has to change backgrounds, i.e. a SX-recombination event 
has to take place. Therefore we obtain in this case the probability (g +i)(e +2) — Psl)PslPlr- 
The sum of these two probabilities gives a 2 i- The other terms can be explained in an analogous 
manner. 



Step 5 : Collecting all together 
We have 



(X , y , Zof = E ■ F ■ C ■ A ■ E~\X T , y T , z T ) T 
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for geometry (a) and 



(X , %, Zo) T =E-F-C-B-E~\X T , y T , Z T ) T 

for geometry (b). 

With this we can compute a 2 D = Zq/Xq (recalling Equation (14)). 

A calculation with Mathematica gives Equations ( 15 )-( 18 1, if terms of order 8 2 and 1/n are 
ignored. □ 
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